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Abstract 

A malleable coding scheme considers not only compression efficiency but also the ease of alteration, 
thus encouraging some form of recycling of an old compressed version in the formation of a new one. 
Malleability cost is the difficulty of synchronizing compressed versions, and malleable codes are of 
particular interest when representing information and modifying the representation are both expensive. 
We examine the trade-off between compression efficiency and malleability cost under a malleability 
metric defined with respect to a string edit distance. This problem introduces a metric topology to the 
compressed domain. We characterize the achievable rates and malleability as the solution of a subgraph 
isomorphism problem. This can be used to argue that allowing conditional entropy of the edited message 
given the original message to grow linearly with block length creates an exponential increase in code 
length. 
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I. Introduction 

The source coding theorem for block codes is obtained by calculating the number of typical source 
sequences and generating a set of labels to enumerate them. Asymptotically almost surely (a.a.s), only 
typical sequences will occur so it is sufficient that the set of labels be as large as the set of typical 
sequences; this yields the achievable entropy bound. As Shannon comments!^ "The high probability 
group is coded in an arbitrary one-to-one way into this set," and so in this sense there is no notion of 
topology of typical sequences. 

If one is concerned with zero error rather than a.a.s. negligible error, the source coding theorem for 
variable-length codes also yields the entropy as an achievable lower bound. In this setting, the mapping 
from source sequences to labels is not allowed to be quite as arbitrary; however, as long as an optimizing 
set of code lengths is correctly matched to source letters, there are still some arbitrary choices in an 
optimal construction 0. 

In contrast to these well-known settings, we investigate the mapping from the source to its compressed 
representation motivated by the following problem. Suppose that after compressing a source X", it is 
modified to become Y"" according to a memory less editing process Py\x- A malleable coding scheme 
preserves some portion of the codeword of and modifies the remainder into a new codeword from 
which y" may be decoded reliably. 

There are several ways to define how one preserves some portion of the codeword of X^. Here we 
concentrate on a malleability cost defined by a normalized edit distance in the compressed domain. 
This is motivated by systems where the old codeword is stored in a rewritable medium; cost is incurred 
when a symbol has to be changed in value, regardless of the location. Recalling the ancient practice of 
scraping and overwriting parchment |3], we call the storage medium a compressed palimpsest and the 
characterization of the trade-offs the palimpsest problem. 

A companion paper [4] focuses on a distinct problem with a similar motivation. There, we fix a part 
of the old codeword to be recycled in creating a codeword for Y"". Without loss of generality, the fixed 
portion can be taken to be the beginning of the codeword, so the new codeword is a fixed prefix followed 
by a new suffix. This formulation is suitable for applications in which the update information (new suffix) 
must be transmitted through a communication channel. If the locations of the changed symbols were to 
be arbitrary, one would need to assign a cost to the indexing of the locations. 

The main result for the palimpsest problem is a graphical characterization of achievable rates and 

'From (T) with emphasis added. 
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number of editing operations. The result involves the solution to the error-tolerant attributed subgraph 
isomorphism problem Q, which is essentially a graph embedding problem. Although graph functionals 
such as independence number |f6l and chromatic number [vj^ often arise in the solution of information 
theory problems, this seems to be the first time that the subgraph isomorphism problem has arisen. 
Moreover, this seems to be the first treatment of the source code as a mapping between metric spaces. 

Several of the results we obtain are pessimistic. Unless the old source and the new source are very 
strongly correlated, a large rate penalty must be paid in order to have minimal malleability cost. Similarly, 
a large malleability cost must be incurred if the rates are required to be close to entropy. 

Outline and Preview: The remainder of the paper is organized as follows. In Section |lll we present a 
few toy examples of coding methods that exhibit a large range of possible trade-offs. Section Hill provides 
additional motivation and context for our work. Section |IV] then provides a formal problem statement, 
and constructive coding techniques paralleling those previewed in Section [E] are developed precisely in 
Section jV] 

In Section |Vll graph embedding techniques are used to specify achievable rate-malleability points. In 
particular. Section I VI- A I deals with Hamming distance as the editing cost and proposes a construction 
using Gray codes. Lower bounds and constructive examples using letter-by-letter encoding and decoding 
are given. This graph embedding approach is generalized in Section [Vl-Cl to include other edit distances 
via generalized minimal change codes. 

While the above delay-free encoding and decoding gives optimal results for a few special cases, we 
consider a more general coding approach in Section I VIII considering both variable-length and block 
codes. In the latter case, we show that the topology of typical sequences plays an important role in our 
problem. Using graph-theoretic ideas, we give an achievability result in Theorem|2] Further, in Theorem|3] 
we argue that a linear reduction in malleability is at exponential cost in compression efficiency, consistent 
with the examples given in Section JIl This theorem is proved for "stationary editing distributions," though 
we believe it to be true for general distributions. In Theorem |4l we give an upper bound on malleability 
cost using the Lipschitz constant of the source code mapping for general distributions. 

Section IVIII I provides some final observations on the trade-off between malleability cost and compres- 
sion efficiency, gives some conclusions, and discusses future work. 

^The chromatic number of a graph can be related to its genus (which is defined by the topological embedding of the graph into 
closed, oriented surfaces (8), (9)), however our interest is in metric graph embedding rather than topological graph embedding. 
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Fig. 1. Qualitative representation of the four simple techniques of Section |ll] For ease of representation, it is assumed that 
H[X) = H{Y). The relative orderings of points are based on H{Z) <^ H{X); this reflects the natural case where the editing 
operation is of low complexity relative to the original string. 



To motivate this exposition prior to defining all quantities precisely, we begin by giving four examples 
of how one can trade off between compression efficiency and malleability. Let X, Y, and Z be binary 
variables with entropies H{X), H{Y), and H{Z), respectively. Suppose that the original observation is 
a word . After compressing X", the original source is modified by adding a binary sequence with 
Hamming weight np to obtain a new word Y"" = X" © Z^. Suppose the storage alphabet is also binary 
and that the cost of synchronization is measured with the extended Hamming distance. Unlike many 
source coding problems where only the cardinality of the set of codewords is used, here the alphabet 
itself is used to measure malleability cost; an abstract set of indices is not appropriate. 

How might the code for X" and the update mechanism to allow representation of Y"" be designed? 
The four possibilities below are summarized in Fig. [T] 

a) No compression: We store n bits for X^. Hence synchronizing to the new version only requires 
changing the same number of bits in the code as were changed from X" to y"; the cost is the Hamming 
weight of , np. 

b) Fully compress X" and Y^: We apply Shannon-type compression, storing only nH{X) bits 
for X". It seems, however, that a large portion of this old codeword will have to be changed — ^perhaps 



n. Simple Examples 
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about half the bits — to become a representation for Yi. Compression efficiency is obtained at the cost 
of malleability. 

c) Fully compress Xf and an increment: Another coding strategy is to compress the change Z" 
separately and append it to the original compression of X^. The new compression then has length 
n{H{X) + H{Z)) > nH(Y) bits. The extended Hamming malleability cost is nH{Z) bits. 

d) Completely favor malleability over compression: Interestingly, there is a method that dramatically 
trades compression efficiency for malleability^ The source X"^ is encoded with 2"^(^) bits, usmg an 
indicator function to denote which of its typical sequences was observed. The same strategy is used to 
encode , using 2"^^^) bits. Then synchronization requires changing only two bits when X" and 
are different. 

Our purpose is to study the limits of this interesting trade-off between compression efficiency and 
malleability. We will do so using formalized performance metrics after a bit more background. 

III. Background 

Our study of malleable compression is motivated by information storage systems that store documents 
which are updated often. In such systems, the storage costs include not only the average length of the 
coded signal, but also the costs in updating. We describe these systems and also discuss an information 
storage system in synthetic biology, where the editing costs are much more significant and restrictive 
than in optical or magnetic systems. 

A. Version Management 

Consider the installation of a security patch to an operating system, the update of a text document after 
proofreading, the storage of a computer file backup system after a day's work, or a second email that 
corrects the location of a seminar yet also reproduces the entire seminar abstract. In all of these settings and 
numerous others, separate data streams may be generated, but the contents differ only slightly ifTOl . ifTTl . 
ifTlll . lfT3l . Moreover, in these applications, old versions of the stream need not be preserved. Particularly 
for devices such as mobile telephones, where memory size and energy are severely constrained, but for 
any storage system, it is advisable to reduce the space taken by data and also to reduce the energy required 
to insert, delete, and modify stored data. In certain applications, in-place reconstruction is desired [,12,1 . 
necessitating the use of instantaneous source codes. 

'Due to Robert G. Gallager. 
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Recursive estimation and control also require temporarily storing state estimates and updating them at 
each time step. Thus such problems also suggest themselves as application areas for malleable coding. 
Note that the application of malleable codes would determine how information storage is carried out, not 
what information is stored and what information is dissipated |[T4l . 

In the scenarios discussed, new versions will be correlated with old versions, not independent as 
assumed in previous studies of write-efficient memories liTSl . |fT6l| . That is, we envision scenarios which 
involve updating Archimedes of Siracusa with Archimedes of Syracuse (Levenshtein distance 
2) rather than updating with Jesus of Nazareth (Levenshtein distance 15), though the results will 
apply to the entire gamut of scenarios. 

There is also another difference between the problems we formulate and prior work on write-efficient 
memories. In write-efficient memories, the encoder can look to see what is already stored in the memory 
before deciding the codeword for the update. An information pattern even more extensive than for write- 
efficient memories was discussed in lITTl . We require the code to be determined before the encoding 
process is carried out. Such an information pattern would arise naturally in remote file synchronization 

ma. 

Once the codeword of the new version is determined (without access to the realized compressed old 
version), there may be settings where the differences between the two must be determined in a distributed 
fashion. For a good malleable code, the old and new codewords will be strongly correlated. Thus, protocols 
for distributed reconciliation of correlated strings may be used lITSl . |[T9l . |[T3l . 

B. Genetic Coding 

With recent advances in biotechnology ||20l . the storage of artificial messages in DNA strings seems 
like a real possibility, rather than just a laboratory pipe dream \i2l\ . Thus the storage of messages in the 
DNA of living organisms as a long-lasting, high-density data storage medium provides another motivating 
application for malleable coding. Note that although minimum change codes, as we will develop for the 
palimpsest problem, have been suggested as an explanation for the genetic code through the optimization 
approach to biology |[22l . here we are concerned with synthetic biology. 

As in magnetic or optical storage and perhaps more so, it is desirable to compress information for 
storage. For a palimpsest system, one would use site-directed mutagenesis 1231 Ch. 7] to perform editing 
of stored codewords whereas for the formulation of malleable coding in [4|, molecular biology cloning 
techniques using restriction enzymes, oligonucleotide synthesis or polymerase chain reaction (PCR), and 
ligation ||23l Ch. 3] would be used. In site-directed mutagenesis when multiple changes cannot be made 
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using a single primer, the cost of a single insertion, deletion, or substitution is approximately the same and 
is additive with respect to the number of edits. Using restriction enzyme methods with oligonucleotide 
synthesis, however, the cost is related to the length of the new segment that must be synthesized to replace 
the old segment. Thus the biotechnical editing costs correspond exactly to the costs defined in the present 
paper and in H. Unlike magnetic or optical storage, insertion and deletion are natural operations in DNA 
information storage, thereby allowing variable-length codes to be easily edited. Incidentally, insertion and 
deletion is also possible in neural information storage through modification of neuronal arbors ll24l . 



After a few requisite definitions, we will provide a formal statement of the palimpsest problem, which 
takes editing costs as well as rate costs into account. 

The symbols of the storage medium are drawn from the finite alphabet V. Note that unlike most source 
coding problems, the alphabet itself will be used, not just the cardinality of sequences drawn from this 
alphabet. Also, it is natural to measure all rates in numbers of symbols from V. This is analogous to 
using base-| V| logarithms in place of base-2 logarithms, and all logarithms should be interpreted as such. 

We require the notion of an edit distance [25 1 on V*, the set of all finite sequences of elements of V. 

Definition 1: An edit distance, d{-, •), is a function from V* x V* to [0, oo), defined by a set of edit 
operations. The edit operations are a symmetric relation on V* x V*. The edit distance between a G V* 
and 6 G V* is if a = 6 and is the minimum number of edit operations needed to transform a into b 
otherwise. 

An example of an edit distance is the Levenshtein distance, which is constructed from insertion, 
deletion, and substitution operations. It can be noted that (V*, d) is a finite metric space (see Appendix lAl). 

Now we can formally define our coding problem. We define the variable-length and block coding 
versions together, drawing distinctions only where necessary. Symbols are reused so as to conserve 
notation. It should be clear from context whether we are discussing variable-length or block coding. 

Let {{Xi, li)}^^ be a sequence of independent drawings of a pair of random variables {X, Y), X G W, 
Y G yV, where W is a finite set and pxy{x,y) = Pr[X = x,Y = y\. The marginal distributions are 



IV. Problem Statement 




and 




When the random variable is clear from context, we write px{x) as p{x) and so on. 
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I p{x,y) 

relates the two marginal distributions. If the joint distribution is such that the marginals are equal, the 
modification channel is said to perform stationary editing. 

Variable-length Codes: A variable-length encoder with block length n is a mapping 

/£ : ^ V*, 

and the corresponding decoder with block length n is 

fn-V* ^ W". 

The encoder and decoder define a variable-length palimpsest code. The encoder and decoder pair is 
required to be instantaneous, in the sense that the encoding may be parsed as a succession of codewords. 
A (variable-length) encoder-decoder with block length n is applied as follows. Let 

{A,B) = {fE{X^)jE{Yn), 
inducing random variables A and B that are drawn from the alphabet V*. Also let 

Block Codes: A block encoder for X with parameters (n, K) is a mapping 

/^-^) : IV" ^ V"^, 
and a block encoder for Y with parameters (n, L) is a mapping 

fP : W" ^ V"^. 

Given these encoders, a common decoder with parameter n is 

Id-V* ^ W". 

The encoders and decoder define a block palimpsest code. Since there is a common decoder, the two 
codes should be in the same format. 

A (block) encoder-decoder with parameters (n, K, L) is applied as follows. Let 

{A,B) = {fP\x^)jP{Yn), 

inducing random variables A E V"^ and B e V"^. The mappings are depicted in Fig. |2] Also let 

{X^{,Y^) = {fn{A),fD{B)). 
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p{X) p(Y\X) piY) 



Fig. 2. Distributions in representation space induced by distributions in source space. 



For both variable-length and block coding, we can define the error rate as 

A = max(Ax, Ay), 

where 

Ax = Pr[Xr / X^] 

and 

Ay = Pr[yi" / 

Natural (and completely conventional) performance indices for the code are the per-letter average 
lengths of the codewords 

K = ^E [e{A)] , 

and 

L = -E [e{B)] , 
n 

where £{■) denotes the length of a sequence in V*. (In the block coding case, A has a fixed length of 
nK letters from the alphabet V, so there is no contradiction in using the previously-defined symbol K. 
Similarly for L.) 

The final performance measure captures our novel concern with the cost of changing the compressed 
version. The malleability cost is the expected per-source-letter edit distance between the codes: 

M = -E [d{A, B)] . 
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p{A,B) 
A B 



p{X,Y) 



Fig. 3. Commutative diagram for the palimpsest problem. 



Definition 2: Given a source p{X,Y) and an edit distance d, a triple {Kq, Lq, Mq) is said to be 
achievable for the variable-length palimpsest problem if, for arbitrary e > 0, there exists (for n sufficiently 
large) a variable-length palimpsest code with error rate A = 0, average codeword lengths K < Kq + e, 
L < Lo + e, and malleability M < Mq + e. 

Definition 3: Given a source p{X,Y) and an edit distance d, a triple (i^Oi -^O) -^o) is said to be 
achievable for the block palimpsest problem if, for arbitrary e > 0, there exists (for n sufficiently large) 
a block palimpsest code with error rate A < e, average codeword lengths K < Kq + e, L < Lq + e, and 
malleabiUty Af < Mq + e. 

For the variable-length palimpsest problem, the set of achievable rate-malleability triples is denoted by 
^Py; for the block version, the corresponding set is denoted by It will be our purpose to characterize 
*Py and as much as possible. 

It follows from the definition that *Py and are closed subsets of and have the property that if 
{Kq, Lq, Mo) G then {Kq + ei, Lq + £2, Mq + £3) G for any > 0, i = 1, 2, 3. Consequently, 
and are completely defined by their lower boundaries, which too are closed. 

Both versions of the palimpsest problem can be viewed using the diagram in Fig. |3] Given p{X, Y) 
and thus piX), p{Y), and p(Y\X), the malleability constraint defines what is achievable in terms of 
p{A, B) with the additional constraints that there must be maps between X^ and A, and between Y^ 
and B, which allow for lossless or near lossless compression. An alternative formulation as the mapping 
between two metric spaces >V" and V* is also possible. 

V. Constructive Palimpsest Examples 

Having formulated the palimpsest problem in Section |IVj we present some examples of what can 
be achieved. These examples revisit Section JI] New examples given in Section |Vl] will inspire general 
statements. 
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A. Source Coding with No Compression 

The simplest compression scheme is one that simply copies the source sequences to the storage medium. 
This is only possible when W = V. When W / V, zero-error coding without compression is possible 
with block lengths larger than 1, as in converting hexadecimal digits to binary digits or vice versa. The 
flexibility in such a mapping can be exploited. If the shortest possible blocking is used and / is the least 
common multiple of |V| and |W|, then there are l\ valid mappings. For the moment, we ignore the gains 



c length n = 1. 
It also follows that the 



to be had by exploiting this flexibility and focus on the W = V case, with bloc 
Taking A = X and B = Y, it follows immediately that K = 1 and L = l|^ 
malleability cost is M = E[d{X,Y)]. If we take the edit distance to be the Hamming distance, then 
M = Fr[X ^ Y]. Thus the triple {K,L,M) = (1, l,Pr[X ^ Y]) is achievable by no compression for 
any source distribution p{X, Y) under Hamming edit distance. 

B. Ignore Malleability 

Consider what happens when the malleability parameter is ignored and the rates for the variable- 
length encoder are optimized. We will improve rate performance and hopefully not worsen malleability 
too much. 

If the updating process Py\x is stationary, then a common instantaneous code may be used to asymp- 
totically achieve K = H{X) and L = H{Y). Picking a single code for different sources has been 
well-studied in the source coding literature, starting with |[26]|. If a single source code is used for a 
collection of distributions, the rate loss over the entropy lower bound is termed the redundancy |27|. As 
shown by Gilbert, if Huffman or Shannon codes are used, this redundancy is the relative entropy between 
the source and the random variable used to design the code. 

Restricting to such instantaneous codes, if the palimpsest code is designed for either p{x) or for p{y), 
the incurred redundancies are the relative entropies 

D{px\\py) = ^ p(x)log— - 

or 

D{py\\px) = Vp(y)log44 
respectively. These lead to horizontal and vertical portions of a lower bound for '^y in the K-L plane. 

''Remember that rates K and L are measured in letters from V, not in bits. 
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An intermediary portion of this lower bound, between the vertical and horizontal portions, is determined 
by finding a random variable Z that is between X and Y and designing a code for it. We want to choose 
some "tilted" distribution, p^, on the geodesic between the two distributions "px and py. 

If "pz = \vx + \vy, then D(j)x\vz) + ^(py||pz) is called the capacitory discrimination [28 1. The 
rate loss in the balanced rate loss case, D(j)x\pz) when D{j)x\vz) = ^kPY^Vz), has a closed form 
expression 11291 . The distribution -pz used to achieve it is halfway (in the asymmetric sense of Y after 
X) along the geodesic that connects the two distributions. The distance along the geodesic may be 
parameterized by 

^ ^ -P(py||px) 

D{j>y\vx) + D(j,x\vy)' 

The resulting rate loss for Zt is 

-D(px||pzJ = R{px,Py) + log fi{t), 
where R{px,Py) is defined through 

1 _ 1 1 

R{px,Py) D{px\\py) D{py\\px)' 

and 

Notice that due to the asymmetry of the relative entropy, this is different than the Chernoff information. 
In general, the connecting portion between the horizontal and vertical parts of the lower bound is curved 
below the time-sharing line, determined by the relative entropies D{px\\pz) and D{py\\pz) for a Z that 
is along the geodesic connecting the two distributions. Fig. |4] shows an example of this achievable lower 
bound. 

If the restriction to instantaneous codes is removed, then there are several kinds of universal source 
codes that achieve the K > H{X) and L > H{Y) bounds simultaneously |[27l . |[30l . however instanta- 
neous codes are required by the palimpsest problem statement. These results say nothing about M, they 
only deal with K and L. 

To say something about M, one can show that the average starting overlap is rather small ||3TI . Since 
optimal source codes produce equiprobable outputs |[32l . one might hope that computing M is a matter of 
measuring the expected edit distance between two random equiprobable sequences |[33l . but optimizing 
the dependence between these two sequences is actually the problem to be solved. 
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H{Y) + 
D(Y\\X) 

H(Y) 



H{X) H{X) + D{X\\Y) 

Fig. 4. K-L region achievable using instantaneous codes for sources related through non-stationary editing. The marked point 
is when rate loss for both versions is balanced. The diagonal line segment shows the suboptimal strategy of time-sharing. 

C. Source Coding with Incremental Compression 

One might compress the original source using an optimal source code, thereby achieving the K > 
H{X) lower bound with equality. Then one may produce an optimal source code for the innovation 
separately, with rate H{Y\X). Thus the new version would be represented by concatenating the two 
pieces, with L = H{X) + H{Y\X) = H{X,Y). Under extended Hamming edit distance, the difference 
between the original source code and the new version which has a new piece concatenated would be 
M = H{Y\X). 

Separate compression of the innovation has the advantage that X" can be recovered from B, however 
this was not a requirement in the problem formulation and is thus wasteful. Such a coding scheme is 
useful in differential encoding for version management systems where all versions should be recoverable. 
Results would basically follow from the chain rule of entropy |[34l or from successive refinability for a 
lossy version of the problem ll35l . 

D. Source Coding with Pulse-Position Modulation 

Another coding strategy is to significantly back off from achieving good rate performance so as to 
achieve very good malleability. In particular, we describe a compression scheme that requires only 
2 substitution edits for any modification to the source, and so the value of M achieved goes to 
asymptotically. 
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We represent any of the possible |>V|" sequences that can occur as or Y"" by a pulse-position 
modulation scheme. In particular, we use only two letters from V, which we call and 1 without loss of 
generality. The codebook is the set of binary sequences of length |W|" with Hamming weight 1. Each 
possible source sequence is assigned to a distinct codebook entry, thus making A = 0. Now modifying 
any sequence to any other sequence entails changing a single to a 1 and a single 1 to a 0. Computing 
the performance criteria, we get that K = L = ^|yV|", and so are paying an exponential rate penalty 
over simply enumerating the source sequences. The payoff is that M = ;|Pr[X" ^ Y""]. This is true 
universally, even if X and Y are independent. Note that if a.a.s. no error is desired, then only typical 
source sequences need to have codewords assigned to them, and K = L = i2"™^^(^(^)>^(^)) (where 
H{X) and H{Y) are here in bits) has the same effect on M. 

Pulse-position modulation is also a possible scheme for achieving channel capacity per unit cost ll36l . 
where an exponential spectral efficiency penalty is paid in order to have very low power. 

VI. Source Coding with Graph Embedding 

Before constructing an example, let us develop some lower bounds for arbitrary sources piX, Y). From 
the source coding theorems, it follows that K > H(X) and L > H(Y). We observe that since distinct 
codewords must have an edit distance of at least one, we can lower bound M by assuming that distance 
1 is achieved for all codewords. Then the edit distance is simply the probability of error for uncoded 
transmission of p{x) through the channel p{y\x), since each error gives edit distance 1 and each correct 
reception gives edit distance 0. Thus for n = 1, M > J2x£W J2y£W-y^xPi^^ v)- ^o^" larger the bound 
is similarly derived to be 

A weaker, simplified version of the bound is M > ^. As will be evident in the sequel, this weaker bound 
is related to Lipschitz constants for the mapping from the source space to the representation space. 

A. Graph Embedding using Gray Codes 

Now we construct an example that simultaneously achieves the rate lower bounds and the malleability 
lower bound ([T]). Consider a memoryless source p(x) with alphabet W = { ^, W, rp, ^, fr, 
T}, such that each letter is drawn equiprobablyo Then the original version of the source has entropy 

^The scholar of linguistics and coding theory will notice the relevance of the order in which the alphabet is written f37l. 
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3 bits. Consider the relationship between X and Y given by a noisy typewriter channel, with channel 
transmission matrix 



p{y\x) 
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Evidently, the bound on M is 1/2 for n = 1, found by performing the summation in ([T]). Moreover, the 
marginal distribution of y is also equiprobable from the alphabet W, which gives the entropy bound on 
L to be 3 bits. 

Take V to be {0,1}. Now we develop a binary encoding scheme that has performance coincident 
with the established inner bounds, using graph embedding methods. We can draw a graph where the 
vertices are the symbols and the edges are labeled with the associated probabilities of transition; the 
weighted directed edges are combined into weighted undirected edges in some suitable way. The result 
is a weighted adjacency graph, a weighted version of the adjacency graphs in 161, Q, shown in Fig. [5] 

Suppose that the edit distance is the Hamming distance. Now we try to embed this adjacency graph 
into a hypercube of a given size. Since we want the average code length to be small, we first consider 
the hypercube of size 3. The adjacency graph is exactly embeddable into the hypercube, as shown in 
Fig. [6l If it were not exactly embeddable, some of the low weight edges might have to be broken. As 
an alternative to the edge weights being determined from the transition matrix, the edge weights may be 
determined through a joint typicality measure (as in the message graph in ll38l and in Section IVlIl ). After 
we complete the embedding into the hypercube, we use the binary reflected Gray code (see e.g. [39] 
for a description) to assign codewords through correspondence. The binary reflected Gray code-labeled 
hypercube is shown in Fig. |7J 

Clearly the error rate for this scheme is A = 0, since the code is lossless. Since all codewords are 
of length 3, clearly K = L = To compute M, notice that any source symbol is perturbed to one of 
its neighbors with probability 1/2. Further notice that the Hamming distance between neighbors in the 
hypercube is 1. Thus M = 1/2. We have seen that this encoding scheme achieves the entropy bounds 
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Fig. 5. Weighted adjacency graph for noisy typewriter channel (|2j- 




Fig. 6. Weighted adjacency graph for noisy typewriter channel embedded in 3-dimensional hypercube. Thick lines represent 
edges that are used in the embedding. Dotted lines represent edges in the hypercube that are unused in the embedding. 








^ 


M01j 




Fig. 7. Hypercube graph labeled with binary reflected Gray code. 
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H{X) and H{Y). It also achieves the n = 1 lower bound for M and is thus optimal for n = 1. We can 
further drive M down by increasing the block length. As shown in the following proposition, if a graph 
is embeddable in another graph and we take Cartesian products of each with itself, then the resulting 
graphs obey the same embedding relationship. 

Definition 4: Consider two graphs G and H with vertices V{G) and V{H) and edges E{G) and 
E{H), respectively. Then G is said to be embeddable into H if H has a subgraph isomorphic to G. That 
is, there is an injective map (p : V{G) —>■ V{H) such that {u,v) G E{G) implies (</>(u), (piv)) € E{H). 
This is denoted as G H. 

Definition 5: Consider two graphs Gi and G2 with vertices V{Gi) and ^(^2) and edges E{Gi) and 
E{G2), respectively. Then the Cartesian product of Gi and G2, denoted Gi x G2 is a graph with vertex 
set V{Gi) X V{G2) and for vertices u = (^1,^2) and v = (^1,^2), {u,v) G E{Gi x G2) when {ui = vi 
and (n2,f2) G E{G2)) or {u2 = V2 and G -^(Gi)). 

Proposition 1: If Gi Hi and G2 -fr2» then Gi x G2 ~^ ^^i x H2. A special case is that G H 
implies G x G H x H. 

Proof: See Appendix iBl ■ 

Corollary 1: Let G" denote the n-fold Cartesian product of G and if" the n-fold Cartesian product 
of H.lfG-^ H, then G" -w if" for n = 1, 2, . . .. 

Proof: By induction. ■ 

Returning to our example, since the embedding relation is true for n = 1, it is also true for n = 2, . . ., 
so we can embed n-fold Cartesian products of the adjacency graph into n-fold Cartesian products of the 
hypercube. Such a scheme would achieve rates of K = 3 bits and L = 3 bits. It would also achieve M 
of iPr[Xf / ¥{"] since the Cartesian product of the adjacency graph represents exactly edit costs of 1. 
For each n, this matches the lower bounds given in ([T|), and is thus optimal. Furthermore, asymptotically 
in n, the triple {K, L, M) = (3, 3, 0) is achievable. 

One may observe that embeddability into a graph where graph distance corresponds to edit distance 
seems to be sufficient to guarantee good performance; we will explore this in detail in the sequel. But first, 
we present a similar but more challenging situation as a contrast to the "best of all worlds" performance 
we have just seen. 

With the source alphabet, representation alphabet, and distribution of X remaining the same, let us 
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suppose that the relationship between X and Y is given by 
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(3) 



One can verify that, Uke (O, this is a stationary editing process. Thus, the rate bounds are unchanged 
at K > ci and L > 3. Also, evaluation of ([T]) yields the bound Af > ^ for block size n = 1. We will 
presently see that the three lower bounds cannot be achieved simultaneously, and we will determine the 
best values of {K, L, M) for n = 1. 

The weighted adjacency graph corresponding to the new editing process is depicted in Fig. [8] Continu- 
ing to use the Hamming edit distance, to achieve K = ?>, L = and the M lower bound simultaneously 
would require the embeddability of the graph of Fig. [8] into the hypercube of size 3. Such embedding is 
clearly not possible since two nodes of the adjacency graph have degree 4, whereas the maximum degree 
of the hypercube is 3. 

To achieve the least increase in M above the lower bound ([T]), we must advantageously choose edges 
in the adjacency graph to break to create embeddability. (As we will see later, choosing the optimal set 
of edges to break involves solving the error-tolerant subgraph isomorphism problem.) In this example, 
the two nodes of degree 4 must each have at least one edge broken. Picking the lowest weight edges 
(the two with weight 1/10) is clearly the best choice, as the resulting graph can be embedded in the 
hypercube and cost of the edits ^ and W is increased by the least possible amount (from 1 
to 2). Each of the broken edges has probability ^ • |, so M is increased above the previously computed 
minimum by Thus we achieve {K,L,M) = (3,3, ||). 

We may alternatively aim for lower M at the expense of K and L. To determine whether the lower 
bound ([T]) can be achieved with K = L = A, we need to check if the weighted adjacency graph of Fig. [8] 
can be embedded in the hypercube of size 4. Fig. |9] shows that this embedding is possible, with the code 
given in Table IH Thus one can achieve {K,L,M) = (4, 4, ^). 
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Fig. 8. Weighted adjacency graph for stationary editing process ((Sjl. 




Fig. 9. Weighted adjacency graph for editing process ^ embedded in 4-dimensional hypercube. Black lines represent edges 
that are used in the embedding. Gray lines represent edges in the hypercube that are unused in the embedding. 

B. Extension to Non-equiprobable Sources 

The fact that both versions in the previous example were equiprobable and thus uncompressible might 
cast doubt on its gravity. Here we consider another example where the sources are not equiprobable. We 
will make use of variable-length lossless source codes and the Levenshtein distance as the edit distance. 
The basic edit operations are substitution, insertion, and deletion, as opposed to the Hamming distance 

TABLE I 

Code for the 4-dimensional hypercube embedding shown in Fig.O 





0000 




0100 




1000 


If 


0010 




0011 




1001 




0101 




0001 
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Fig. 10. Levenshtein Edit Distance Graph for {0, 1} U {0, 1}^ U {0, 1}^. 

TABLE II 

Huffman Code for 4-ary source. 





p{x) 


/Huffman (2^) 


^Huffman (^) 


p{x)£{x) 
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where substitution is the only edit operation. Similar to the hypercube graph for the Hamming distance, 
we can create a Levenshtein edit distance graph. The Levenshtein graph of binary strings up to length 3 
is shown in Fig. (TO] 

Consider a memory less source with alphabet W = { ^, W, rp}, with probabilities shown in 
Table |lll Also in Table JIJ we find a Huffman code for the source, which is the best variable-length 
lossless source code 121. Since the marginal distribution p{x) is dyadic, it is at the center of a code 
attraction region of the binary Huffman code and achieves the entropy lower bound exactly BOl : 

K=Y^ p{x)i{x) = 1.75 = H{X) = - ^ p{x)log2P{x). 
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Fig. 11. Adjacency graph for noisy typewriter-like channel. 




Fig. 12. Adjacency graph for noisy typewriter-like channel embedded in the Levenshtein graph. 



Now consider a channel that is like the noisy typewriter channel, with channel transmission matrix 





piy\x) 



3 1 

4 2 



-s 




i 

1 1 

2 4 

1 3 

4 4 



(4) 



Evidently the editing is stationary, so the same Huffman code is optimal for both X and Y. Constructing 
the adjacency graph yields Fig. [TT] This graph can be embedded (with matched vertex labels) in the 
Levenshtein graph using the Huffman assignment that we had developed, as shown in Fig. [121 
Evaluating the malleability lower bound ([T) for n = 1 in this case gives 



M>Y1 Yl p^'^'y^ 



With the code that we have used, we can achieve the triple {K,L,M) = (|,|,|) which meets the 
n = 1 lower bounds tightly, so it is optimal in the compression and malleability senses. As before, 
we can consider Cartesian products to reduce M, however, things are a bit more complicated since the 
Levenshtein graph does not grow as a Cartesian product. 
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C. Minimal Change Codes 

As seen in the previous subsection, Gray codes and related minimal change code constructions seem 
to play a role in achieving good palimpsest performance. We review minimal change codes and some of 
their previous uses in communication theory, pointing out connections to our problem. We use minimal 
change codes to expand our treatment in the previous parts from using just Hamming or Levenshtein 
distances to include general edit distances. 

Definition 6: Let G be a connected graph. The path metric dc associated with the graph G is the 
integer- valued metric on the vertices of G which is defined by setting dciu, v) equal to the length of the 
shortest path in G joining u and v. 

Proposition 2: For any edit distance d : V* x V* ^ [0, oo), there exists a graph G with vertex set V* 
such that its path metric do = d. 

Proof: Construct a graph on vertex set V* by adding an edge for any pair of vertices A, B V* 
such that d{A, B) = 1. ■ 

Definition 7: An ordered codebook (A-i), i = 1,2, . . . , Ai G V* is a minimal change code with 

respect to edit distance d if it is a Hamiltonian path in a subgraph of the graph on V* associated with d. 

Our definition of minimal change codes is a generalization of Gray codes, which are Hamiltonian 
paths through the hypercube associated with Hamming distance Il4l1l . Other minimal change codes may 
include Hamiltonian paths through the Levenshtein graph (Fig. [TOl ). the de Bruijn graph, or the graph 
induced by Dobrushin's distance functions for insertion/deletion channels 142]. There are countless other 
edit distances with numerous minimal change codes corresponding to each. 

Minimal change codes have been used previously in the architecture design of parallel computers 
and in switching theory, among other places. Of particular interest to us, however, is their use in joint 
source-channel coding (JSCC) |43 |. There are related problems in signal constellation labeling |[39l . fl4l . 
P31 . in the genotype to phenotype mapping problem mentioned previously |[22l . |[9|, or in the problem of 
labeling books for ease of browsing f46l. There are also several theories of cognition based on preserving 
similarity relations from a source space in a representation space, though minimal change codes do not 
seem to be used explicitly pTll . 

Consider JSCC with source alphabet X, channel input alphabet A, channel output alphabet B, and 
source reconstruction alphabet y. Then the injective mapping between X and A is the index assignment 
for JSCC. The mapping between A and B is given by the noisy channel, a transition probability 
assignment, p{b\a). The surjective mapping between B and y is the inverse index assignment operation. 
The goal in selecting index assignments is to minimize the distortion between the X — y spaces when 
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there are errors between the A — 13 spaces. Informally using terminology from genetics, the source spaces 
X and y are cast as phenotype, whereas the index spaces A and B are cast as genotype. Then index 
assignment aims to have small mutations in genotype result in small changes in phenotype. 

In the palimpsest problem]^ the injective and surjective mappings between X and A as well as B and 
3^ are basically the same as in the joint source channel coding problem. The distinction between the 
two problems is that for malleable coding, there is a transition probability assignment between X and y, 
rather than between A and B. One goal is to minimize the distance between words in the A and B spaces 
for perturbations in the X and y spaces. Using the genetics analogy, index assignments so that small 
changes in phenotype result in small changes in genotype are desired. One might even call malleable 
coding a problem in joint channel source coding. 

Considering that index assignment for JSCC, signal constellation labeling, and the palimpsest problem 
are so similar, it is not surprising that Gray codes come up in all of them 1431 , |[39l . All are essentially 
problems of embedding: performing a transformation on objects of one type to produce objects of a 
new type such that the distance between the transformed objects approximates the distance between the 
original objects |[25l . 

VII. General Characterizations 

We have seen that there may be a trade-off between the various parameters {K, L, M) and have found 
several easily achieved points. Our interest now turns to obtaining more detailed characterizations of 
and the sets of achievable rate-malleability triples. 

A. Variable-length Coding 

We begin with characterization of *Py, which is a problem in zero error information theory HSl . P9l . 
Our results are expressed in terms of the solution to an error-tolerant attributed subgraph isomorphism 
problem Q, which we first describe in general. 

1 ) Error-Tolerant Attributed Subgraph Isomorphism Problem: A vertex-attributed graph is a three- 
tuple G = {V, E, fi), where V is the set of vertices, E <^V x V is the set of edges, and fj, : V V* is 
a function assigning labels to vertices. The set of labels is denoted V*. The definition of embedding for 
attributed graphs has a slightly stronger requirement than for unattributed graphs, Def. ID 

Definition 8: Consider two vertex-attributed graphs G = {V{G), E{G),iig) and H = (y{H),E{H), fin)- 
Then G is said to be embeddable into H if H has a subgraph isomorphic to G. That is, there is an injective 

*To make the correspondence more precise, let X — y = W and A — B — V. 
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map (j) : V{G) ->■ V{H) such that hg{v) = finiH'")) for all v G V{G) and that {u,v) e E{G) implies 
(/)(«)) G E{H). This is denoted as G H. 

Several graph editing operations may be defined, such as substituting a vertex label, deleting a vertex, 
deleting an edge, and inserting an edge. These four operations are powerful enough to transform any 
attributed graph into a subgraph of any other attributed graph. The edited graph is denoted through the 
operator £{■) corresponding to the sequence of graph edit operations £ = {ei, . . . , Cfc). There is a cost 
associated with each sequence of graph edit operations, C{£). 

Definition 9: Given two graphs G and H, an error-correcting attributed subgraph isomorphism tp from 
G to H is the composition of two operations -0 = {£,<p£) where 

• <S is a sequence of graph edit operations such that there exists a £{G) that satisfies £{G) H. 

• is an embedding of £{G) into H. 

Definition 10: The subgraph distance p{G, H) is the cost of the minimum cost error-correcting at- 
tributed subgraph isomorphism ijj from G to H. 
Note that in general, p{G, H) / p{H, G). 

Remark 1: It should be noted that the subgraph isomorphism problem is NP-complete f50], and 
therefore the error-tolerant subgraph isomorphism problem is in the class NP and is generally harder 
than the exact subgraph isomorphism problem Q. 

2 ) Closeness Vitality: The subgraph isomorphism cost structure for the palimpsest problem is based on 
a graph theoretic quantity called the closeness vitality |[5T| . Vitality measures determine the importance 
of particular edges and vertices in a graph. 

Definition 11: Let C5 be the set of all graphs G = {V,E), and let / : C5 ^ M be any real-valued 
function on 0. A vitality index v{G, x) is the difference of the values of / on G and on G without 
element x; it satisfies v{G, x) = f{G) — f{G — x). 

A particular vitality index is the closeness vitality, defined in terms of the Wiener index ll52l . which is 
simply the sum of all pairwise distances. 

Definition 12: The Wiener index fw{G) of a graph G is the sum of the distances of all vertex pairs: 

Definition 13: The closeness vitality cv(G, x) is the vitality index with respect to the Wiener index: 

cviG,x) = fw{G)-fwiG-x). 

In addition to the application in the palimpsest problem, the closeness vitality also determines traffic- 
related costs in all-to-all routing networks. 
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Finding the distance matrix to compute the Wiener index involves solving the all-pairs shortest path 
problem. Finding the distance matrix of a modified graph from the distance matrix of the original graph 
involves solving the dynamic all-pairs shortest path problem ||53l , |54l . 

3) Characterization: For our purposes, we are concerned with the error-tolerant embedding of 
an attributed, weighted source adjacency graph into the graph induced by a V*-space edit distance. As 
such, edge deletion will be the only graph editing operation that is required. Error-tolerant embedding 
problems in pattern recognition and machine vision often have simple cost functions lUl, ll55l : our cost 
function is determined by the closeness vitality and is not so simple. 

To characterize ^Py, let us first consider the delay-free case, n = 1. A source p{X,Y) and an 
edit distance d{-,-) are given. It is known ||2l that Huffman coding provides the minimal redundancy 
instantaneous code and achieves expected performance H{X) < K < H{X) + 1. Similarly, a Huffman 
code for Y would yield H{Y) < L < H{Y) + 1. The rate loss for using an incorrect Huffman code is 
essentially as given in Fig. |4l Suppose that we require that the rate lower bound is met, i.e. we must use 
a Huffman code for some Z that is on the geodesic between X and Y . This code will satisfy the Kraft 
inequality [56 1. Note that for a given Z, there are several Huffman codes: those arising from different 
labelings of the code tree and also perhaps different trees 157]. Let us denote the set of all Huffman 
codes for Z as Hz- 

Since K and L are fixed by the choice of Z, all that remains is to determine the set of achievable 
M. Let G be the graph induced by the edit distance d{-, •), and dc its path metric. The graph G is 
intrinsically labeled. Let A be the weighted adjacency graph of the source p{X,Y), with vertices W, 
edges E{A) a subset of W x W, and labels given by a Huffman code. That is >1 = (W, E{A),fE) for 
some Je S Hz- There is a path semimetric, dA, associated with the graph A (since the adjacency graph 
is weighted, it might not satisfy the triangle inequality). 

As may be surmised from Section |Vl the basic problem is to solve the error-tolerant subgraph 
isomorphism problem of embedding A into G. In general for n = 1, the malleability cost under edit 
distance do when using the source code /e is 

^ = E Y.p('^^y^dG{fEix)jE{y)). 

The smallest malleability possible is when A = {W, E{A), /e) is a subgraph of G, and then 

Mmin = E E Pix,y)dA{x,y) = E E Pix,y)dG{fE{x)jE{y)) = Pr[X / Y], 
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which is simply the expected Wiener index 

Mmin = E[fw{A)] = Pr[X / Y]. 

If edges in A need to broken in order to make it a subgraph of G, then M increases as a result. The 
cost of graph editing operations in the error-tolerant embedding problem should reflect the effect on M. 
If an edge e is removed from the graph A, the resulting graph is called A — e; it induces its own path 
semimetric dA-e- Thus the cost of removing an edge, e, from the graph A is given by the following 
expression as a function of the associated removal operation e: 

which is the negative expected closeness vitality 

C(<5) = -E[cw{A,e)]. 
If <f is a sequence of edge removals, £, then 

^(^)= E E^'(^'y) [dA~BUE{x)jE[y))-dAUE{x)jE{y))], 

xgW yew 

which is 

C{8) = -E[cw{A,£)]. 

As seen, the cost function is quite different from standard error-tolerant embedding problems Q, |[55l 
since it depends not only on which edge is broken, but also on the remainder of the graph. 
Putting things together, we see that *Py contains any point 



{K,L,M) = [H{X) + D{px\\pz) + l,H{Y) + D{pY\\pz) + l,Mmin+ min p{A,G) 
\ jEeHz 

The previous analysis had assumed n = 1. We may increase the block length and improve performance. 

Theorem 1: Consider a source p{X,Y) with associated (unlabeled) weighted adjacency graph A and 

an edit distance d with associated graph G. For any n, let fJJ^'^^-' be the set of triples {K, L, M) that are 

computed, by allowing an arbitrary choice of the memoryless random variable p{Z'^), as follows: 

K = H{X) + D{px\\pz) + ^, 
L = H{Y)+D{pY\\pz) + h 
M = i Pr[Xr ^ Y^ + ^ min p(A = (>V^ EiA), Je), G). 

Then the set of triples ^pj^"^^^ c *Py is achievable instantaneously. 
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Proof: A non-degenerate random variable Z" is fixed. Tliere is a family of instantaneous lossless 
codes (with A = 0) that corresponds to this random variable, denoted {(/s, /d)} = Wzj", through the 
McMillan sum. By the results in |[26l . any of these codes achieve rates K < H{X) + D{px\\pz) + ^ 
and L < H{Y) + D{py\\pz) + ^- Moreover, by the graph embedding construction, a code {fE^fo) 
achieves M = iPr[Xf / Ff] + \p{A = {W\ E{A)Je),G). Since all codes in Tizi^ ^^"^^ *e same 
rate performance, a code in the family that minimizes p may be chosen. ■ 
The above theorem states that error-tolerant subgraph isomorphism implies achievable malleability. The 
choice of the auxiliary random variable Z is open to optimization. If minimal rates are desired, then pz 
should be on the geodesic connecting px and py. If Z is not on the geodesic, then there is some rate 
loss, but perhaps there can be some malleability gains. 

Note that when p{y\x) is a stationary editing process, there is the possibility of the simple lower 
bounds being tight to this achievable region. 

Corollary 2: Consider a source as given above in Theorem [T] If p{y\x) is stationary, p{x) = p{y) is 
|V|-adic, and there is a Huffman-labeled A for p{x) = p{y) that is an isometric subgraph of G, then the 
block length n lower bound {H{X), H{Y), iPr[X" / ^i"]) is tight to this achievable region for every 
n, and in particular to {H{X),H(Y),Q) for large n. 

B. Block Coding 

Now we turn our attention to the block-coding palimpsest problem. For we use a joint typicality 
graph rather than the weighted adjacency graph used for Additionally we focus on binary block 
codes under Hamming edit distance, so we are concerned only with hypercubes rather than general edit 
distance graphs. 

We can use graph-theoretic ideas to formally state an achievability result for the block coding palimpsest 
problem. As shown in the constructive examples above, there are schemes for which an improvement on 
M may be achieved by increasing L. However, the resulting compression of Y"" is not unique, and thus 
is not optimal. We wish to expurgate the redundant representations of as efficiently as possible, by 
the aid of a graph. However, in doing so, we have to also consider the representations and how they are 
related to one another. First we review some standard typicality arguments (from HH) and then define 
a graph from typical sets. 

Definition 14: The strongly typical set Tr^,g with respect to p{x) is 



Ti 



[X]S 
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where N{x; x") is the number of occurrences of x in x" and 6 > 0. 

Definition 15: The strongly jointly typical set T^y]s ^i*^^ respect to p{x, y) is 



- [XY\5 



I yj^) G X I ^ I iiV(x, y- x^,y^) - p{x, y) \ < 



Definition 16: For any G define a strongly conditionally typical set 

TrY\x]sK) = {yi e r^hs I W'O e ^[xy].} ■ 
Definition 17: Let the connected strongly typical set be 

S\'x]s = {^1 e t;^x]5 I T^[y|x]5(^?) nonempty} . 

Now that we have definitions of typical sets, we put forth some lemmas. 
Lemma 1 (Strong AEP): Let be a small positive number such that r/ — > as 
sufficiently large n, 



-[X]5 



< 2niHiX)+r,)_ 



Proof: See ||58| Theorem 5.2]. 
Lemma 2 (Strong JAEP): Let A be a small positive number such that A ^ as 
sufficiently large n, 

pr[(xr,yneT[i^]5]>i-^ 

and 



(1-5)2 



n{H(X,Y)->^) 



< 



rpn 

^[XY]S 



< 2 



n{H{X,Y)+^) 



Proof: See 1581 Theorem 5.8]. 
Lemma 3: If 5{n) satisfies the following conditions, then Lemma [2] remains valid: 

(5(n) and \/n5{n) ^ cxd as n — > oo. 



Proof: See 1591 (2.9) on p. 34]. 



Lemma 4: If 



[Y\x]sy^i 



> 1, then 

2niHiY\X)-u) < 



'^\Y\X]5i^l) 



^ 2n{H{Y\X)+u) 



where — > as n ^ oo and 6^0. 
Proof: See ISH Theorem 5.9]. 
Lemma 5: 

(1 - ^)2"(^(^)-^) < 



on 

'^[X]5 
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where ^ as n ^ oo and (5 — > 0. Also, for any 5 > 0, 

Pr[XrE5p^],]>l-<5 

for n sufficiently large. 

Proof: See Ell Corollary 5.11], Lemma E and US. Proposition 5.12]. ■ 
For the bivariate distribution p{x,y), define a square matrix called the strong joint typicality matrix 

^fxY] follows. There is one row (and column) for each sequence in SJ^-^^g U SJ!^^^. The entry with 

row corresponding to and column con^esponding to receives a one if (x",?/") is strongly jointly 

typical and zero otherwise. 

1 ) Stationary Editing: Now let us restrict ourselves to the class of bivariate distributions with equal 

marginals: 

V = {p{x,y) I p{x) =p{y)} , 

which is the class of distributions with stationary editing. In this class, we avoid the mismatch redundancy 
and also reduce the number of performance parameters from 3 to 2. After this restriction, it is clear that 
the x-typical set and the y-typical set coincide. Moreover, H{X) = H{Y) and H{Y\X) = H{X\Y) = 
H{X)—I{X; Y). Thus it follows that asymptotically, ^|^y] will be a square matrix with an approximately 
equal number of ones in all columns and in all rows. Think of ^j^y] as the adjacency matrix of a graph, 
where the vertices are sequences and edges connect sequences that are jointly typical with one another. 

Proposition 3: Take ^pfy] for some source in V as the adjacency matrix of a graph Q"^. The number 
of vertices in the graph will satisfy 

(1 — S)2'^^^^^^~'^'^ < \V{G^)\ < 2"(-'^("^)+'^\ 

where -0 ^ as n ^ oo and S ^ 0. The degree of each vertex, deg^, will concentrate as 

2niHiY\X)-u) < jgg^ < 2"(^(^l-^)+'^), 

where ^ as n ^ oo and 6^0. 

Proof: Follows from the previous lemmas. ■ 

Having established the basic topology of the strongly typical set as asymptotically gular 
graph on 2"^("^) vertices, we return to the coding problem. Using graph embedding ideas yields a theorem 
on block palimpsest achievability. 

Theorem 2: For a source p{x, y) G "P and the Hamming edit distance, a triple {K, K,M = Mmin) is 
achievable if ~^ HnK- 
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Proof: To achieve Mmin, we need to assign binary codewords to each of the 2"^^"^) vertices, such 
that the Hamming distance between the codeword of a vertex and the codewords of any of its neighbors is 
1. Using the binary reflected Gray code of length nK and the hypercube that it induces, the construction 
reduces to finding an embedding of Q"- into the hypercube of size nK, denoted HnK- Thus a sufficient 
condition for block-code achievability, while requiring M = Mmin, is Q"' ~^ HnK- ■ 

Using this result, we argue that a linear increase in malleability is at exponential cost in code length. 
By a simple counting argument, we present a condition for embeddability. 

Theorem 3: For a source p{x, y) £ V and the Hamming edit distance, asymptotically, if Q"^ -w HnK 
then 

nK > max (nH{X), 2"^(^l^)) . (5) 

Proof: The hypercube HnK is an nK-regular graph with 2"^ vertices. As a minimal condition for 
embeddability, the number of vertices in the hypercube must be greater than or equal to the number 
of vertices in the graph to be embedded, i.e. nK > n{H{X) + ifj). As another minimal condition for 
embeddability, the degree of the hypercube must be greater than or equal to the maximal degree of the 
graph to be embedded, so nK > 2"(^(^l^)+^). Combining the two conditions and letting V' ^ and 
1/ — > as n ^ oo yields the desired result. ■ 

This theorem is one of our main results. It should be noted that even if we allowed some asymptotically 
small slack in breaking some edges to perform embedding, i.e. we solved an error-tolerant subgraph 
isomorphism problem with error tolerance ^, this would not help, since we would need to break a 
constant fraction of edges in Q" to reduce the maximal degree. In particular, since each of the nH{X) 
vertices in Q"^ asymptotically has the same degree, to reduce the maximal degree even by one would 
require breaking ^ > nH{X) edges. Clearly ^ ^ as n ^ oo. 

This result can be interpreted as follows. When using binary codes that achieve the minimal malleability 
parameter, the length of the code must be greater than max (n//(X), 2"^(^1"'^)). If 2"^(^l"^) is much 
greater than nH{X), i.e. the two versions are not particularly well correlated, this implies that to achieve 
minimal malleability requires a significant length expansion of the codewords over the entropy bound. 
Taking this to an extreme, suppose that X and Y are independent. Then 2"^(^l^) = 2"^(^\ and an 
exponential expansion is required, just as in the universal PPM scheme of Section IV-DI 

If we want to understand the embeddability requirements further, we would need to understand the 
topology of further. Just knowing that it is asymptotically regular does not seem to be enough. Several 
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properties that are equivalent to exact hypercube embeddability are given in Of course we 

can break some small fraction of edges in the graph to satisfy the embeddability conditions as long 
as ^ ^ as n ^ cxD. If we no longer require that nM be the minimal possible, then we are back to 
the same kind of error-tolerant subgraph isomorphism formulation given for variable length coding in 
the previous section. The only change in the characterization of the achievable region is that rather than 
restricting the encoder to be the Huffman code of an auxiliary random variable , here one would need 
to test the error-tolerant subgraph isomorphism functional over all permutations of labelings. 

2) General Editing: If we remove our restriction of p{x, y) G V, then we can create ^j^yj as before. 
While the resulting graph would not be asymptotically regular, the basic result on paying an exponential 
rate penalty will still hold. 

The space 5" = S^^j^ U S^y]5 with the corresponding path metric, dA induced by ^|^y] is a metric 
space. Hypercubes with their natural path metric, do, are also a metric space. Rather than requiring 
absolutely minimal nM, it can be noted that M is asymptotically zero when the Lipschitz constant 
associated with the mapping between the source space and the representation space has nice properties 
in n. 

Definition 18: A mapping / from the metric space (5", d^™) to the metric space (V"^, c^G") is called 
Lipschitz continuous if 

dG'^{f{xi), f{x2)) < CdAr^ixi,X2) 

^There are several characterizations of hypercube-embeddable graphs in the metric theory of graphs 1601 , 1611 . For a bipartite 
connected graph G = {V, E) the statements are equivalent: 

• G can be isometrically embedded into a hypercube. 

• G satisfies G{a,b) = {x £ V\dG{x,a) < dG{x,b)} is convex for each edge (a, b) of G. A subset ?7 C 1/ is convex if 
it is closed under taking shortest paths. 

• G is an £i graph, i.e. the path metric do is isometrically embeddable in the space ii. 

• The path metric da satisfies the pentagonal inequality: 

dG{vi,V2) + daivijVs) + dG{v2,V3) + dG{v4,V5) ~ ^ dG{vh,Vk)<0 

h=l,2,3 4,5 

for all nodes vi, . . . ,V5 £ V. 

• The distance matrix of G has exactly one positive eigenvalue. 

Further, a graph is said to be distance regular if there exist integers bm,Cm (m > 0) such that for any two nodes i,j € V{G) 
at distance dG{i,j) = m there are exactly c„i nodes at distance 1 from i and distance m — 1 from j, and there are hm nodes 
at distance 1 from i and distance m + 1 from j. The distance-regular graphs that are hypercube embeddable are completely 
classified: the hypercubes, the even circuits, and the double-odd graphs. 
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for some constant C and for all xi,X2 G 5*". The smallest such C is the Lipschitz constant: 

T ■ r .1 {f{xi),f{x2)) 
Up[f] = sup — — ^ . 

The Lipschitz constant is also called the dilation of an embedding, since it is the maximum amount 
that any edge in is stretched as it is replaced by a path in HnK ll62]| . |[63l . A related quantity is the 
Lipschitz constant of the inverse mapping, called the contraction: 

J. r.-li dA^{xi,X2) 

Lip[/ J = sup 

Xi^X2g5" AG" [f{Xl),f{X2)) 

The product of the dilation and contraction Lip[/] Lip[/~^] is called the distortion. Another property of 
metric embeddings in the expansion, which is the ratio of the sizes of the two finite metric spaces, 

expan[/] = -|^- 

We can bound the malleability M, for a coding scheme that only represents sequences in 5" as follows. 

Theorem 4: Let the Lipschitz constant be as defined. Then for a coding scheme Je that only represents 
sequences in 5" = S"j^^g U Sf^^g, we have that 

^< Lip[/e] (i + <5diam(g")) 
n 
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Proof: The proof is given as follows: 



n 



5] 



Lip[/£ 



n 



(6) Lip[/i5] 



n 



Pr 



a;5'GS",yf GS" 



Lip[/i;] 



n 



1 + 5 max dA'^{xi,y?) 

x?G5",s/rGS" 



Lip[/£ 



n 



{l + (5diam(e?")}, 



where step (a) is by definition of the Lipschitz constant; step (b) follows from the definition of graph 
distance and the consistency of strong typicality (fSSl Theorem 5.7]); and step (c) is from bounding 
Pr (x", y") G '^ixY]S ^ ^^'^ from Lemma|2] Note that the 6 bound used in step (c) for the probability 
of sequences pairs that are both marginally typical but not jointly typical is actually the probability of 
all non-jointly typical pairs and is therefore loose. ■ 
Computing Lipschitz constants is usually difficult or impossible. There are, however, methods from 
theoretical computer science for bounding Lipschitz constants (or dilation) for embeddings ll62l . ll63ll . For 
a "host" graph H and a "guest" graph G, a basic counting argument reminiscent of Theorem [3l shows 
that the dilation for any embedding of Q into H must satisfy 

Lip /£ > :j 7- — , 

where dg and dy, are the respective maximum degrees |f62', Prop. 1.5.2]. When the guest graph is the 
joint typicality graph and the host graph is the hypercube HnK, this implies that 



Lip[/£] > 



■log(2"(^(^l^)+'^) - !)■ 
log{nK — 1) 



nH{Y\X) 
log nK 
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Another typical result arises when it is fixed that both graphs have m vertices (expansion is 1). The 
Lipschitz constant bound is in terms the bisection width Wg and the recursive edge -bisector function 
R-j-ci')- The dilation Lip[/£;] of any embedding Je of G into Tl must satisfy 

upifE] > (r^) log . r^, , - 1- 

\logdnJ dgRnim) 

The bisection width of a graph is the size of the smallest cut-set that breaks the graph into two subgraphs 
of equal sizes (to within rounding) [62, Prop. 2.3.6]. Using such a result is difficult since the bisection 
width of the joint typicality graph is not known. For the case when H = H^nH{x)'\ » ^ simplified version 
reduces to 

riogm] _ nHjX) 
iPLJEj _ ^jiajn(gn) - diam(g")' 

where diam(^") is the same graph diameter we had seen before ||63l. If the dilation is to be no greater than 
2, the PPM scheme we have described previously may be reinterpreted in a graph embedding framework 
and seen to achieve Lip[/£;] < 2, but the price is exponential expansion, expan[/£;] = ^^2"^'^. 
Returning to Theorem |4l as noted in Lemma [3l 6 can be taken as 

for some fixed a; > 0. The diameter of the hypercube Hnx is clearly nK. Combining this with the 
contraction provides a bound on the diameter of Q"-: 

diam(g") < Lip[/- V^- 
Thus one can further bound the expression in Theorem |4] as 

^< Lip[/s] (i + 5diam(g")) 

n 



<MMfi + „-H-Lip[/-Vi^ 

n \ 



Lip[/^^^Lip[/s]Lip[/^i] 



This yields the following proposition. 

Proposition 4: The malleability M is asymptotically bounded above by: 

1- ...lMfEl^KUp[fE]Up[f^'] 
nmsupM < h 



for any fixed lu > 0. 

The quantity M is essentially bounded by n^^/^ET Lip[/£;] Lip[/^^] since the second term should 
dominate the first. An alternate expression for is = ^ log2 expan[/£;] + H{X), which is fixed. If 
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Lip[/E] Lip[/^ ] is o{^/n) and Lip[/£;] is o(n) for the sequence of encoders /e, M will go to zero 
asymptotically in n. Due to the bounding methods that were used, it is not at all clear whether this 
Lipschitz bound on malleability is tight, and one might suspect that it is not. A slightly different branch 
of theoretical computer science deals with bounding the distortion of mappings |[64l . ll65l . however it is 
not clear how to apply these results to the palimpsest problem. 

VIII. Discussion and Conclusions 

We have formulated an information theoretic problem motivated by applications in information storage 
where a compressed stored document must often be updated and there are costs associated with writing on 
the storage medium. That there are always editing costs in overwriting rewritable media is a fundamental 
fact of thermodynamics and follows from Landauer's principle ll66l : Since discarding information results 
in a dissipation of energy, overwriting causes an inextricable loss of energy. 

Both the compressed palimpsest problem considered here and a distinct problem with a similar 
motivation presented in a companion paper |4| exhibit a fundamental trade-off between compression 
efficiency and the costs incurred when synchronizing between two versions of a source code. The 
palimpsest problem is concerned with random access editing, where changing nearby or greatly separated 
symbols in the compressed representation have the same cost. The "cut-and-paste" formulation of |4J is 
concerned with editing large subsequences, as would be appropriate when there is a cost associated with 
communicating the positions of edits. 

The basic result is that unless the two versions of the source are either very strongly correlated or 
have a deterministically common part, if rates close to entropy are required for both sources, then a large 
malleability cost will have to be paid. Similarly, if small malleability is required, a very large rate penalty 
will be paid. There is a fundamental trade-off between the quantities. 

For our compressed palimpsest problem, we found that if minimal malleability costs are desired, then 
a rate penalty that is exponential in the conditional entropy of the editing process must be paid. That is, 
unless the two versions of the source are very strongly correlated (conditional entropy logarithmic in block 
length), rate exponentially larger than entropy is needed. A universal scheme for minimal malleability is 
given by a pulse position modulation method. Thus, if we require malleability M = 0(l/n), then rates 
K and L must be J7(i2"). 

One may be tempted to try to cast the block palimpsest problem in terms of error-correcting codes, 
where the quality metric is the block Hamming distance. The Hamming distance does not care how 
two letters differ, it only cares whether they are different. In a sense, it is an l^o distance. This gives 
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rise to error-correcting codes that try to maximize the minimum distance between two codewords in the 
codebook. In malleable coding, we care not just about whether a modified codeword is inside or outside 
the minimum distance decoding region for the original codeword, but how far, basically treating the space 
with a symbol edit distance, which may be £i. 

Appendix A 
{V*,d) IS A Finite Metric Space 

A metric must satisfy non-negativity, equality, symmetry, and the triangle inequality. These properties 
are verified for any edit distance with edit operation R as follows. 

• non-negativity: follows since the edit distance is a counting measure. 

• equality: follows by definition, since the distance is zero if and only if a = b. 

• symmetry: If d{a, b) = n, then it follows there is a sequence of n— 1 intermediate strings, ai, 02, . . . , a„_i 
which along with = a and an = b satisfy (aj,aj+i) G R. Since R is a. symmetric relation, it 
follows that (aj + 1, ai) is also in R, and so there is a backwards sequence a„, a^-i, . . . , ao- Hence 

if d{a, b) = n then d{b, a) = n also, and so d{a, b) = d{b, a) for all a, b. 

• triangle inequality: Suppose d{a, b)+d{b, c) < d{a, c). Then there is a sequence of editing operations 
(oj, Cj+i) that goes from a to c via b in d{a, b) + d{b, c) steps. Now perform the editing operations of 
d{a, b) followed by the operations of d{b, c), which requires d{a, b) + d{b, c) steps. This contradicts 
the initial assumption, hence d{a, b) + d{b, c) > d{a, c). 

Appendix B 
Proof of Proposition [T] 

Since d -w Hi, V{Gi) C V{Hi). Since G2 H2, ^(^2) C V{H2). Then by elementary set 
operations, F(GixG2) = V{Gi)xV{G2) C V{HixH2) = V{Hi)xV{H2). Since Gi ^ Hi, E{Gi) C 
E{Hi). Since G2 H2, E{G2) C E{H2). Consider an edge {u,v) G E{Gi x G2). By definition of 
Cartesian product, it satisfies (ui = vi and (n2,f2) G E{G2)) or {u2 = V2 and {ui,vi) G E{Gi)), but 
since E{Gi) C E{Hi) and E{G2) Q E{H2), it also satisfies {ui = vi and {u2,V2) G E{H2)) or {u2 = V2 
and {ui,vi) G E[Hi)). Therefore E{Gi x G2) C E{Hi x H2). Since V{Gi x G2) C V{Hi x H2) and 
E[Gi X G2) C E{Hi X H2), GixG2^ Hix H2. 
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